The whole process of my attempt to predict the Hotel Occupancy Rate (TPK BPS) was carried out using Jupyter Notebook version 6.1.6 on Python 3.8.2 x64 for Windows.
These are the libraries I used in this competition:
- Pandas, for the data processing using table-like form
- Numpy, for the data processing using array-like form
- Scikit-learn, for the machine learning tasks
- Plotly, for data graphing
- Matplotlib for data plotting
The committee only gave us Daily Hotel Occupancy Rate retrieved online (tpk_harian) as X variable and Monthly Hotel Occupancy Rate published by BPS (tpk_bps) as Y variable The lack of data encouraged me to get other sources as follows:
- covid_harian_aktif = daily covid active cases, retrieved from KawalCovid19
- covid_harian = daily new covid cases, retrieved from KawalCovid19
- covid_total = total cases of covid at the end of the month (last day), Retrieved from KawalCovid19
- penerbangan = the number of flight passengers to Bali, retrieved from bali.bps.go.id
- wisatawan = the number of domestic tourists coming to Bali, retrieved from bps.go.id and from Contact Person from Disparda Bali (Dinas Pariwisata Bali)
- wisatawan_mancanegara = the number of foreign tourists coming to Bali, retrieved from bali.bps.go.id
- tpk_bps_arima = the data of Monthly Hotel Occupancy Rate published by BPS (tpk_bps) from the previous months (y-1)
- hari = the number of days in a month
- mobility = google mobility data index for INDONESIA (not Bali in particular), retrieved from OurWorldInData.Org
Therefore, I included the initial codes of data pre-processing for the other insignificant variables. However, in the model.fit and model.predict, only the significant variables are used as the final predictors.
As for the models, here are the ones I tried running on the data:
- Linear Regression
- Ridge Regression
- Random Forest Regressor
- Support Vector Regression (SVR)
- K-Nearest Neighbor Regressor
- MLPRegressor (Neural Network Regression)
- Lasso Regression
- Decision Tree Regressor
Out of all eight, the ones that frequently appear to have the lowest values of RMSE are: Random Forest, Lasso and and SVR. Random Forest and Lasso work best with more independent variables, but the RMSE values are still higher than of SVR with less indepedent variables.
# Modules installation, if prompted
# ! pip install pandas
# ! pip install sklearn
# ! pip install plotly
# ! pip install matplotlib
# ! pip install voila
# Import modules and give aliases to it
# Modules for data reading
import pandas as pd
import sklearn as sk
import numpy as np
# Modules for data graphing
import plotly.graph_objects as go
import matplotlib.pyplot as plt
# Modules for model building
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Modules for model testing
import math
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
# Reading CSV Files, the data used in this project
# Data for Dependent Variable
tpk_bps = pd.read_csv('Datasets/train-TPK_Hotel_berbintang_2020.csv') # from the committee
# Data for Independent Variables
tpk_harian = pd.read_csv('Datasets/train-online_booking_2020.csv') # from the committee
penerbangan = pd.read_csv('Datasets/train-penerbangan_2020.csv') # number of flight passengers, from BPS website
wisatawan = pd.read_csv('Datasets/train-wisatawan_domestik_2020.csv') # from Disparda Bali
# Insignificant independent variables, hence cancelled
covid_harian_aktif = pd.read_csv('Datasets/train-covid_cases_bali_2020.csv') # from KawalCovid19
covid_harian = pd.read_csv('Datasets/train-covid_cases_bali_harian_2020.csv') # from KawalCovid19
covid_total = pd.read_csv('Datasets/train-covid_cases_bali_total_2020.csv') # from KawalCovid19
wisatawan_mancanegara = pd.read_csv('Datasets/train-wisatawan_mancanegara_2020.csv') # from Disparda Bali
tpk_bps_arima = pd.read_csv('Datasets/train-TPK_Hotel_berbintang_2020_plus_des_2019.csv') # y-1
hari = pd.read_csv('Datasets/train-hari_2020.csv') # number of days in a month
mobility = pd.read_csv('Datasets/train-google_mobility_2020.csv') # historical google mobility index
# Remove possible nan value for data with the original form of daily historical data
tpk_harian.dropna()
covid_harian.dropna()
covid_harian_aktif.dropna()
| tanggal | covid_bali | |
|---|---|---|
| 0 | 1/1/2020 | 1 |
| 1 | 2/1/2020 | 1 |
| 2 | 3/21/2020 | 2 |
| 3 | 3/22/2020 | 1 |
| 4 | 3/23/2020 | 4 |
| ... | ... | ... |
| 283 | 12/27/2020 | 894 |
| 284 | 12/28/2020 | 897 |
| 285 | 12/29/2020 | 926 |
| 286 | 12/30/2020 | 967 |
| 287 | 12/31/2020 | 1043 |
288 rows × 2 columns
# This function is retrieved from https://gist.github.com/Xylambda/b8f38dce74dd3d54ff906eebfe560ac0
# which uses Fast Fourier Transform to denoise the possible noisy data on covid_harian, covid_harian_aktif and tpk_harian
def fft_denoiser(x, n_components, to_real=True):
"""Fast fourier transform denoiser.
Denoises data using the fast fourier transform.
Parameters
----------
x : numpy.array
The data to denoise.
n_components : int
The value above which the coefficients will be kept.
to_real : bool, optional, default: True
Whether to remove the complex part (True) or not (False)
Returns
-------
clean_data : numpy.array
The denoised data.
References
----------
.. [1] Steve Brunton - Denoising Data with FFT[Python]
https://www.youtube.com/watch?v=s2K1JfNR7Sc&ab_channel=SteveBrunton
"""
n = len(x)
# compute the fft
fft = np.fft.fft(x, n)
# compute power spectrum density
# squared magnitud of each fft coefficient
PSD = fft * np.conj(fft) / n
# keep high frequencies
_mask = PSD > n_components
fft = _mask * fft
# inverse fourier transform
clean_data = np.fft.ifft(fft)
if to_real:
clean_data = clean_data.real
return clean_data
The following code of HANDLING OUTLIERS and CALCULATING MONTHLY DATA for covid_harian are used interchangeably for covid_harian_aktif.
# Calculate upper quantile of covid_harian
max_thresold = covid_harian['covid_bali'].quantile(0.90)
max_thresold
127.19999999999993
# Remove outliers from upper quantile
covid_harian = covid_harian[(covid_harian['covid_bali']<max_thresold)]
# Display data after outliers removal
covid_harian
| tanggal | covid_bali | |
|---|---|---|
| 0 | 1/1/2020 | 0 |
| 1 | 2/1/2020 | 0 |
| 2 | 3/15/2020 | 0 |
| 3 | 3/16/2020 | 0 |
| 4 | 3/17/2020 | 0 |
| ... | ... | ... |
| 285 | 12/23/2020 | 122 |
| 286 | 12/24/2020 | 123 |
| 287 | 12/25/2020 | 112 |
| 288 | 12/26/2020 | 96 |
| 289 | 12/27/2020 | 66 |
264 rows × 2 columns
# Denoise the possible noisy data covid_harian
covid_harian['covid_bali_denoised_fft'] = fft_denoiser(covid_harian['covid_bali'], 10, to_real=True)
covid_harian['covid_bali_denoised_fft']
0 0.277043
1 -0.063758
2 0.184509
3 -0.206951
4 0.199587
...
285 121.451228
286 123.339444
287 112.071488
288 96.216400
289 65.455154
Name: covid_bali_denoised_fft, Length: 264, dtype: float64
# Display data after denoising
covid_harian
| tanggal | covid_bali | covid_bali_denoised_fft | |
|---|---|---|---|
| 0 | 1/1/2020 | 0 | 0.277043 |
| 1 | 2/1/2020 | 0 | -0.063758 |
| 2 | 3/15/2020 | 0 | 0.184509 |
| 3 | 3/16/2020 | 0 | -0.206951 |
| 4 | 3/17/2020 | 0 | 0.199587 |
| ... | ... | ... | ... |
| 285 | 12/23/2020 | 122 | 121.451228 |
| 286 | 12/24/2020 | 123 | 123.339444 |
| 287 | 12/25/2020 | 112 | 112.071488 |
| 288 | 12/26/2020 | 96 | 96.216400 |
| 289 | 12/27/2020 | 66 | 65.455154 |
264 rows × 3 columns
# Show information on data types
covid_harian.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 264 entries, 0 to 289 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tanggal 264 non-null object 1 covid_bali 264 non-null int64 2 covid_bali_denoised_fft 264 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 8.2+ KB
# Change the datatype of variable 'tanggal' into datetime (YEAR-MONTH-DATE)
covid_harian['tanggal'] = pd.to_datetime(covid_harian['tanggal'], format="%m/%d/%Y")
covid_harian.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 264 entries, 0 to 289 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tanggal 264 non-null datetime64[ns] 1 covid_bali 264 non-null int64 2 covid_bali_denoised_fft 264 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1) memory usage: 8.2 KB
# Prepare variable 'tanggal' as index for grouping the data based on month (the following code)
covid_harian = covid_harian.set_index('tanggal')
# Make a new data frame, named covid_harian_agg
# which groups the data into months using the median on the data, and 'tanggal' as the index
covid_harian_agg = covid_harian.groupby(pd.Grouper(freq='M')).median()
# Reset the index on new data frame and show the datatypes
covid_harian_agg = covid_harian_agg.reset_index()
covid_harian_agg.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12 entries, 0 to 11 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tanggal 12 non-null datetime64[ns] 1 covid_bali 12 non-null float64 2 covid_bali_denoised_fft 12 non-null float64 dtypes: datetime64[ns](1), float64(2) memory usage: 416.0 bytes
# Display the new data frame
covid_harian_agg
| tanggal | covid_bali | covid_bali_denoised_fft | |
|---|---|---|---|
| 0 | 2020-01-31 | 0.0 | 0.277043 |
| 1 | 2020-02-29 | 0.0 | -0.063758 |
| 2 | 2020-03-31 | 0.0 | 0.190967 |
| 3 | 2020-04-30 | 6.0 | 6.245022 |
| 4 | 2020-05-31 | 6.0 | 5.772174 |
| 5 | 2020-06-30 | 31.0 | 30.962186 |
| 6 | 2020-07-31 | 61.0 | 60.658668 |
| 7 | 2020-08-31 | 48.5 | 48.819108 |
| 8 | 2020-09-30 | 86.0 | 86.071488 |
| 9 | 2020-10-31 | 87.0 | 87.454587 |
| 10 | 2020-11-30 | 68.5 | 68.866959 |
| 11 | 2020-12-31 | 102.0 | 102.536645 |
The original data does not have tpk_online, below is the function to calculate it based on the available rooms and total rooms in the data.
# Show information on data types
tpk_harian.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 696 entries, 0 to 695 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 province 696 non-null object 1 type 696 non-null object 2 tanggal 696 non-null object 3 klasifikasi 696 non-null object 4 all_available_room 696 non-null int64 5 room_total 696 non-null int64 dtypes: int64(2), object(4) memory usage: 32.8+ KB
# Change the datatype of variable 'tanggal' into datetime (YEAR-MONTH-DATE)
tpk_harian['tanggal'] = pd.to_datetime(tpk_harian['tanggal'], format="%m/%d/%Y")
# Show first five rows of the data frame
tpk_harian.head()
| province | type | tanggal | klasifikasi | all_available_room | room_total | |
|---|---|---|---|---|---|---|
| 0 | Bali | Hotel | 2020-01-01 | Rated | 2155 | 37756 |
| 1 | Bali | Hotel | 2020-01-01 | Not Rated | 163 | 392 |
| 2 | Bali | Hotel | 2020-01-02 | Rated | 3878 | 38814 |
| 3 | Bali | Hotel | 2020-01-02 | Not Rated | 211 | 415 |
| 4 | Bali | Hotel | 2020-01-03 | Rated | 5412 | 39084 |
# Show information on data types
tpk_harian.dtypes
province object type object tanggal datetime64[ns] klasifikasi object all_available_room int64 room_total int64 dtype: object
# Calculate used/booked rooms on tpk_harian
# the total of rooms minus the available rooms on that day (not booked)
tpk_harian['used_room'] = tpk_harian.room_total-tpk_harian.all_available_room
# Calculate the value of tpk_online (the percentage
# of booked rooms on total rooms
tpk_harian['tpk_online'] = tpk_harian.used_room/tpk_harian.room_total*100
# Shows data tpk_harian of which variables tpk_online and used room have been added into
tpk_harian
| province | type | tanggal | klasifikasi | all_available_room | room_total | used_room | tpk_online | |
|---|---|---|---|---|---|---|---|---|
| 0 | Bali | Hotel | 2020-01-01 | Rated | 2155 | 37756 | 35601 | 94.292298 |
| 1 | Bali | Hotel | 2020-01-01 | Not Rated | 163 | 392 | 229 | 58.418367 |
| 2 | Bali | Hotel | 2020-01-02 | Rated | 3878 | 38814 | 34936 | 90.008760 |
| 3 | Bali | Hotel | 2020-01-02 | Not Rated | 211 | 415 | 204 | 49.156627 |
| 4 | Bali | Hotel | 2020-01-03 | Rated | 5412 | 39084 | 33672 | 86.152901 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 691 | Bali | Hotel | 2020-12-29 | Not Rated | 475 | 568 | 93 | 16.373239 |
| 692 | Bali | Hotel | 2020-12-30 | Rated | 24240 | 45600 | 21360 | 46.842105 |
| 693 | Bali | Hotel | 2020-12-30 | Not Rated | 427 | 551 | 124 | 22.504537 |
| 694 | Bali | Hotel | 2020-12-31 | Rated | 26194 | 46764 | 20570 | 43.986827 |
| 695 | Bali | Hotel | 2020-12-31 | Not Rated | 441 | 523 | 82 | 15.678776 |
696 rows × 8 columns
# Calculate upper quantile of the data
max_thresold = tpk_harian['tpk_online'].quantile(0.90)
max_thresold
50.37341670604346
# Calculate lower quantile of the data
min_thresold = tpk_harian['tpk_online'].quantile(0.10)
min_thresold
13.100003163420439
# Remove outliers from upper and lower quantiles
tpk_harian = tpk_harian[(tpk_harian['tpk_online']<max_thresold) & (tpk_harian['tpk_online']>min_thresold)]
# Denoise the possible noisy data
tpk_harian['tpk_online_denoised_fft'] = fft_denoiser(tpk_harian['tpk_online'], 10, to_real=True)
tpk_harian['tpk_online_denoised_fft']
<ipython-input-27-e0d2ec7322ec>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tpk_harian['tpk_online_denoised_fft'] = fft_denoiser(tpk_harian['tpk_online'], 10, to_real=True)
3 48.018343
5 42.963230
7 37.568041
9 35.800638
11 34.379302
...
691 16.033055
692 47.276324
693 23.166606
694 44.618422
695 15.458028
Name: tpk_online_denoised_fft, Length: 556, dtype: float64
# Prepare variable 'tanggal' as index for grouping the data based on month (the following code)
tpk_harian = tpk_harian.set_index('tanggal')
# Make a new data frame, named covid_cases_agg
# which groups the data into months using the median on the data, and 'tanggal' as the index
tpk_harian_agg = tpk_harian.groupby(pd.Grouper(freq='M')).median()
Sementara pake Mean, jika dirasa ada metode agregasi lain yang lebih mewakili, bisa dicoba disini
# Reset the index on new data frame and show the datatypes
tpk_harian_agg = tpk_harian_agg.reset_index()
tpk_harian_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 459.0 | 766.0 | 279.0 | 37.106918 | 37.433877 |
| 1 | 2020-02-29 | 433.5 | 713.0 | 284.0 | 39.448381 | 39.570870 |
| 2 | 2020-03-31 | 629.5 | 758.0 | 186.0 | 24.497126 | 25.584338 |
| 3 | 2020-04-30 | 21190.0 | 33285.0 | 9520.0 | 22.779846 | 22.753135 |
| 4 | 2020-05-31 | 16659.0 | 32539.0 | 15707.0 | 30.792608 | 31.062169 |
| 5 | 2020-06-30 | 33407.0 | 50624.0 | 16181.0 | 32.457119 | 32.635825 |
| 6 | 2020-07-31 | 34482.0 | 51258.0 | 15414.0 | 30.432621 | 30.308360 |
| 7 | 2020-08-31 | 2274.0 | 3793.0 | 1519.0 | 32.988680 | 32.097701 |
| 8 | 2020-09-30 | 29066.0 | 49852.0 | 19863.0 | 37.913000 | 37.222877 |
| 9 | 2020-10-31 | 29403.5 | 49999.5 | 18812.0 | 35.942146 | 35.518501 |
| 10 | 2020-11-30 | 27918.0 | 48904.0 | 19868.0 | 39.083442 | 38.770749 |
| 11 | 2020-12-31 | 522.0 | 650.0 | 146.0 | 23.817292 | 24.476106 |
# Show shape (dimension) of array
tpk_harian_agg.shape
(12, 6)
# Show information on data types
mobility.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 320 entries, 0 to 319 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 entity 320 non-null object 1 code 320 non-null object 2 day 320 non-null object 3 retail_and_recreation 320 non-null float64 4 grocery_and_pharmacy 320 non-null float64 5 residential 320 non-null float64 6 transit_stations 320 non-null float64 7 parks 320 non-null float64 8 workplaces 320 non-null float64 dtypes: float64(6), object(3) memory usage: 22.6+ KB
# Change the datatype of variable 'day' into datetime (YEAR-MONTH-DATE)
mobility['day'] = pd.to_datetime(mobility['day'], format="%m/%d/%Y")
mobility.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 320 entries, 0 to 319 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 entity 320 non-null object 1 code 320 non-null object 2 day 320 non-null datetime64[ns] 3 retail_and_recreation 320 non-null float64 4 grocery_and_pharmacy 320 non-null float64 5 residential 320 non-null float64 6 transit_stations 320 non-null float64 7 parks 320 non-null float64 8 workplaces 320 non-null float64 dtypes: datetime64[ns](1), float64(6), object(2) memory usage: 22.6+ KB
# Prepare variable 'day' as index for grouping the data based on month (the following code)
mobility = mobility.set_index('day')
# Make a new data frame, named mobility_agg
# which groups the data into months using the median on the data, and 'day' as the index
mobility_agg = mobility.groupby(pd.Grouper(freq='M')).median()
# Reset the index on new data frame and show the datatypes
mobility_agg = mobility_agg.reset_index()
mobility_agg.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12 entries, 0 to 11 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 day 12 non-null datetime64[ns] 1 retail_and_recreation 12 non-null float64 2 grocery_and_pharmacy 12 non-null float64 3 residential 12 non-null float64 4 transit_stations 12 non-null float64 5 parks 12 non-null float64 6 workplaces 12 non-null float64 dtypes: datetime64[ns](1), float64(6) memory usage: 800.0 bytes
# Display the new data frame
mobility_agg
| day | retail_and_recreation | grocery_and_pharmacy | residential | transit_stations | parks | workplaces | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 0.000 | 0.0000 | 0.0000 | 0.000 | 0.0000 | 0.0000 |
| 1 | 2020-02-29 | -2.286 | -2.1430 | 1.0000 | -0.571 | -4.5710 | 2.8570 |
| 2 | 2020-03-31 | -0.857 | 2.8570 | 1.8570 | -3.571 | -5.1430 | 2.7140 |
| 3 | 2020-04-30 | -37.571 | -20.7855 | 16.2860 | -56.571 | -35.7140 | -31.6430 |
| 4 | 2020-05-31 | -38.714 | -13.8570 | 17.0000 | -55.143 | -37.7140 | -34.8570 |
| 5 | 2020-06-30 | -25.000 | -7.0715 | 12.6425 | -43.857 | -23.4285 | -19.7855 |
| 6 | 2020-07-31 | -18.000 | -3.4290 | 11.1430 | -36.000 | -17.4290 | -18.1430 |
| 7 | 2020-08-31 | -14.286 | 0.0000 | 10.2860 | -32.143 | -8.1430 | -22.5710 |
| 8 | 2020-09-30 | -15.000 | -2.5000 | 10.5000 | -35.000 | -8.4290 | -17.7855 |
| 9 | 2020-10-31 | -18.714 | -0.5710 | 10.0000 | -33.857 | -9.0000 | -20.5710 |
| 10 | 2020-11-30 | -15.857 | -0.3575 | 7.4290 | -29.571 | -11.2860 | -21.4290 |
| 11 | 2020-12-31 | -15.000 | 3.0000 | 10.0000 | -25.571 | -11.4290 | -21.1430 |
# Add a new column named 'Id' on the data frame tpk_harian_agg,
# value ranges from 1 to 12
# The display the data
tpk_harian_agg['Id'] = np.arange(start=1, stop=13, step=1)
tpk_harian_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | Id | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 459.0 | 766.0 | 279.0 | 37.106918 | 37.433877 | 1 |
| 1 | 2020-02-29 | 433.5 | 713.0 | 284.0 | 39.448381 | 39.570870 | 2 |
| 2 | 2020-03-31 | 629.5 | 758.0 | 186.0 | 24.497126 | 25.584338 | 3 |
| 3 | 2020-04-30 | 21190.0 | 33285.0 | 9520.0 | 22.779846 | 22.753135 | 4 |
| 4 | 2020-05-31 | 16659.0 | 32539.0 | 15707.0 | 30.792608 | 31.062169 | 5 |
| 5 | 2020-06-30 | 33407.0 | 50624.0 | 16181.0 | 32.457119 | 32.635825 | 6 |
| 6 | 2020-07-31 | 34482.0 | 51258.0 | 15414.0 | 30.432621 | 30.308360 | 7 |
| 7 | 2020-08-31 | 2274.0 | 3793.0 | 1519.0 | 32.988680 | 32.097701 | 8 |
| 8 | 2020-09-30 | 29066.0 | 49852.0 | 19863.0 | 37.913000 | 37.222877 | 9 |
| 9 | 2020-10-31 | 29403.5 | 49999.5 | 18812.0 | 35.942146 | 35.518501 | 10 |
| 10 | 2020-11-30 | 27918.0 | 48904.0 | 19868.0 | 39.083442 | 38.770749 | 11 |
| 11 | 2020-12-31 | 522.0 | 650.0 | 146.0 | 23.817292 | 24.476106 | 12 |
# Add a new column named 'Id' on the data frame covid_harian_agg
# value ranges from 1 to 12
# The display the data
covid_harian_agg['Id'] = np.arange(start=1, stop=13, step=1)
covid_harian_agg
| tanggal | covid_bali | covid_bali_denoised_fft | Id | |
|---|---|---|---|---|
| 0 | 2020-01-31 | 0.0 | 0.277043 | 1 |
| 1 | 2020-02-29 | 0.0 | -0.063758 | 2 |
| 2 | 2020-03-31 | 0.0 | 0.190967 | 3 |
| 3 | 2020-04-30 | 6.0 | 6.245022 | 4 |
| 4 | 2020-05-31 | 6.0 | 5.772174 | 5 |
| 5 | 2020-06-30 | 31.0 | 30.962186 | 6 |
| 6 | 2020-07-31 | 61.0 | 60.658668 | 7 |
| 7 | 2020-08-31 | 48.5 | 48.819108 | 8 |
| 8 | 2020-09-30 | 86.0 | 86.071488 | 9 |
| 9 | 2020-10-31 | 87.0 | 87.454587 | 10 |
| 10 | 2020-11-30 | 68.5 | 68.866959 | 11 |
| 11 | 2020-12-31 | 102.0 | 102.536645 | 12 |
# Add a new column named 'Id' on the data frame penerbangan
# value ranges from 1 to 12
# The display the data
penerbangan['Id'] = np.arange(start=1, stop=13, step=1)
penerbangan
| tanggal_ter | penerbangan | Id | |
|---|---|---|---|
| 0 | 1-Jan-20 | 1094169 | 1 |
| 1 | 1-Feb-20 | 772595 | 2 |
| 2 | 1-Mar-20 | 527776 | 3 |
| 3 | 1-Apr-20 | 50874 | 4 |
| 4 | 1-May-20 | 4047 | 5 |
| 5 | 1-Jun-20 | 12273 | 6 |
| 6 | 1-Jul-20 | 43492 | 7 |
| 7 | 1-Aug-20 | 84721 | 8 |
| 8 | 1-Sep-20 | 81321 | 9 |
| 9 | 1-Oct-20 | 99562 | 10 |
| 10 | 1-Nov-20 | 169895 | 11 |
| 11 | 1-Dec-20 | 189485 | 12 |
# Add a new column named 'Id' on the data frame wisatawan
# value ranges from 1 to 12
# The display the data
wisatawan['Id'] = np.arange(start=1, stop=13, step=1)
wisatawan
| tanggal_wis | wisatawan | Id | |
|---|---|---|---|
| 0 | 1-Jan-20 | 879702 | 1 |
| 1 | 1-Feb-20 | 721105 | 2 |
| 2 | 1-Mar-20 | 567452 | 3 |
| 3 | 1-Apr-20 | 175120 | 4 |
| 4 | 1-May-20 | 101948 | 5 |
| 5 | 1-Jun-20 | 137395 | 6 |
| 6 | 1-Jul-20 | 229112 | 7 |
| 7 | 1-Aug-20 | 355732 | 8 |
| 8 | 1-Sep-20 | 283349 | 9 |
| 9 | 1-Oct-20 | 337304 | 10 |
| 10 | 1-Nov-20 | 425097 | 11 |
| 11 | 1-Dec-20 | 382841 | 12 |
# Add a new column named 'Id' on the data frame tpk_bps_arima
# value ranges from 1 to 12
# The display the data
tpk_bps_arima['Id'] = np.arange(start=1, stop=13, step=1)
tpk_bps_arima
| tanggal | TPK_arima | Id | |
|---|---|---|---|
| 0 | 1-Dec-19 | 62.55 | 1 |
| 1 | 1-Jan-20 | 59.29 | 2 |
| 2 | 1-Feb-20 | 45.98 | 3 |
| 3 | 1-Mar-20 | 25.41 | 4 |
| 4 | 1-Apr-20 | 3.22 | 5 |
| 5 | 1-May-20 | 2.07 | 6 |
| 6 | 1-Jun-20 | 2.07 | 7 |
| 7 | 1-Jul-20 | 2.57 | 8 |
| 8 | 1-Aug-20 | 3.68 | 9 |
| 9 | 1-Sep-20 | 5.28 | 10 |
| 10 | 1-Oct-20 | 9.53 | 11 |
| 11 | 1-Nov-20 | 9.32 | 12 |
# Add a new column named 'Id' on the data frame hari
# value ranges from 1 to 12
# The display the data
hari['Id'] = np.arange(start=1, stop=13, step=1)
hari
| tanggal | hari | Id | |
|---|---|---|---|
| 0 | 1-Jan-20 | 31 | 1 |
| 1 | 1-Feb-20 | 29 | 2 |
| 2 | 1-Mar-20 | 31 | 3 |
| 3 | 1-Apr-20 | 30 | 4 |
| 4 | 1-May-20 | 31 | 5 |
| 5 | 1-Jun-20 | 30 | 6 |
| 6 | 1-Jul-20 | 31 | 7 |
| 7 | 1-Aug-20 | 31 | 8 |
| 8 | 1-Sep-20 | 30 | 9 |
| 9 | 1-Oct-20 | 31 | 10 |
| 10 | 1-Nov-20 | 30 | 11 |
| 11 | 1-Dec-20 | 31 | 12 |
# Add a new column named 'Id' on the data frame covid_total
# value ranges from 1 to 12
# The display the data
covid_total['Id'] = np.arange(start=1, stop=13, step=1)
covid_total
| tanggal | covid_bali_total | Id | |
|---|---|---|---|
| 0 | 1/31/2020 | 0 | 1 |
| 1 | 2/29/2020 | 0 | 2 |
| 2 | 3/31/2020 | 19 | 3 |
| 3 | 4/30/2020 | 222 | 4 |
| 4 | 5/31/2020 | 465 | 5 |
| 5 | 6/30/2020 | 1493 | 6 |
| 6 | 7/31/2020 | 3407 | 7 |
| 7 | 8/31/2020 | 5207 | 8 |
| 8 | 9/30/2020 | 8878 | 9 |
| 9 | 10/31/2020 | 11764 | 10 |
| 10 | 11/30/2020 | 13879 | 11 |
| 11 | 12/31/2020 | 17593 | 12 |
# Add a new column named 'Id' on the data frame mobility_agg
# value ranges from 1 to 12
# The display the data
mobility_agg['Id'] = np.arange(start=1, stop=13, step=1)
mobility_agg
| day | retail_and_recreation | grocery_and_pharmacy | residential | transit_stations | parks | workplaces | Id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 0.000 | 0.0000 | 0.0000 | 0.000 | 0.0000 | 0.0000 | 1 |
| 1 | 2020-02-29 | -2.286 | -2.1430 | 1.0000 | -0.571 | -4.5710 | 2.8570 | 2 |
| 2 | 2020-03-31 | -0.857 | 2.8570 | 1.8570 | -3.571 | -5.1430 | 2.7140 | 3 |
| 3 | 2020-04-30 | -37.571 | -20.7855 | 16.2860 | -56.571 | -35.7140 | -31.6430 | 4 |
| 4 | 2020-05-31 | -38.714 | -13.8570 | 17.0000 | -55.143 | -37.7140 | -34.8570 | 5 |
| 5 | 2020-06-30 | -25.000 | -7.0715 | 12.6425 | -43.857 | -23.4285 | -19.7855 | 6 |
| 6 | 2020-07-31 | -18.000 | -3.4290 | 11.1430 | -36.000 | -17.4290 | -18.1430 | 7 |
| 7 | 2020-08-31 | -14.286 | 0.0000 | 10.2860 | -32.143 | -8.1430 | -22.5710 | 8 |
| 8 | 2020-09-30 | -15.000 | -2.5000 | 10.5000 | -35.000 | -8.4290 | -17.7855 | 9 |
| 9 | 2020-10-31 | -18.714 | -0.5710 | 10.0000 | -33.857 | -9.0000 | -20.5710 | 10 |
| 10 | 2020-11-30 | -15.857 | -0.3575 | 7.4290 | -29.571 | -11.2860 | -21.4290 | 11 |
| 11 | 2020-12-31 | -15.000 | 3.0000 | 10.0000 | -25.571 | -11.4290 | -21.1430 | 12 |
# Join all the data frames (used or not used in the final models)
# based on the column ID into a new data frame named tpk_join.
# And display the first five rows of the data.
tpk_join = pd.merge(tpk_bps,tpk_harian_agg,on='Id',how='left')
tpk_join = pd.merge(tpk_join,covid_harian_agg,on='Id',how='left')
tpk_join = pd.merge(tpk_join,penerbangan,on='Id',how='left')
tpk_join = pd.merge(tpk_join,wisatawan,on='Id',how='left')
tpk_join = pd.merge(tpk_join,tpk_bps_arima,on='Id',how='left')
tpk_join = pd.merge(tpk_join,hari,on='Id',how='left')
tpk_join = pd.merge(tpk_join,covid_total,on='Id',how='left')
tpk_join = pd.merge(tpk_join,mobility_agg,on='Id',how='left')
tpk_join.head()
| Id | Provinsi | Tahun | Bulan | Aggregate_var | TPK | tanggal_x | all_available_room | room_total | used_room | ... | hari | tanggal | covid_bali_total | day | retail_and_recreation | grocery_and_pharmacy | residential | transit_stations | parks | workplaces | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Bali | 2020 | Januari | NaN | 59.29 | 2020-01-31 | 459.0 | 766.0 | 279.0 | ... | 31 | 1/31/2020 | 0 | 2020-01-31 | 0.000 | 0.0000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 2 | Bali | 2020 | Februari | NaN | 45.98 | 2020-02-29 | 433.5 | 713.0 | 284.0 | ... | 29 | 2/29/2020 | 0 | 2020-02-29 | -2.286 | -2.1430 | 1.000 | -0.571 | -4.571 | 2.857 |
| 2 | 3 | Bali | 2020 | Maret | NaN | 25.41 | 2020-03-31 | 629.5 | 758.0 | 186.0 | ... | 31 | 3/31/2020 | 19 | 2020-03-31 | -0.857 | 2.8570 | 1.857 | -3.571 | -5.143 | 2.714 |
| 3 | 4 | Bali | 2020 | April | NaN | 3.22 | 2020-04-30 | 21190.0 | 33285.0 | 9520.0 | ... | 30 | 4/30/2020 | 222 | 2020-04-30 | -37.571 | -20.7855 | 16.286 | -56.571 | -35.714 | -31.643 |
| 4 | 5 | Bali | 2020 | Mei | NaN | 2.07 | 2020-05-31 | 16659.0 | 32539.0 | 15707.0 | ... | 31 | 5/31/2020 | 465 | 2020-05-31 | -38.714 | -13.8570 | 17.000 | -55.143 | -37.714 | -34.857 |
5 rows × 32 columns
# Delete (drop) unneccessary variables
tpk_join = tpk_join.drop(['tanggal_x', 'tanggal_y', 'tanggal', 'Aggregate_var', 'day', 'Provinsi'], axis=1)
# menamplilkan lima data teratas dari tpk_join
tpk_join.head()
| Id | Tahun | Bulan | TPK | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | covid_bali | ... | wisatawan | TPK_arima | hari | covid_bali_total | retail_and_recreation | grocery_and_pharmacy | residential | transit_stations | parks | workplaces | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2020 | Januari | 59.29 | 459.0 | 766.0 | 279.0 | 37.106918 | 37.433877 | 0.0 | ... | 879702 | 62.55 | 31 | 0 | 0.000 | 0.0000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 2 | 2020 | Februari | 45.98 | 433.5 | 713.0 | 284.0 | 39.448381 | 39.570870 | 0.0 | ... | 721105 | 59.29 | 29 | 0 | -2.286 | -2.1430 | 1.000 | -0.571 | -4.571 | 2.857 |
| 2 | 3 | 2020 | Maret | 25.41 | 629.5 | 758.0 | 186.0 | 24.497126 | 25.584338 | 0.0 | ... | 567452 | 45.98 | 31 | 19 | -0.857 | 2.8570 | 1.857 | -3.571 | -5.143 | 2.714 |
| 3 | 4 | 2020 | April | 3.22 | 21190.0 | 33285.0 | 9520.0 | 22.779846 | 22.753135 | 6.0 | ... | 175120 | 25.41 | 30 | 222 | -37.571 | -20.7855 | 16.286 | -56.571 | -35.714 | -31.643 |
| 4 | 5 | 2020 | Mei | 2.07 | 16659.0 | 32539.0 | 15707.0 | 30.792608 | 31.062169 | 6.0 | ... | 101948 | 3.22 | 31 | 465 | -38.714 | -13.8570 | 17.000 | -55.143 | -37.714 | -34.857 |
5 rows × 24 columns
# Save the resulted data frame tpk_join into csv file in my local
tpk_join.to_csv('Results/tpk_join_v999991.csv')
This step hopes to see the rough pattern between tpk_bps (Y variable) against every independent variables. Library scatter plot (imported in the beginning of this source code file) is used for this purpose.
# Display scatter plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['tpk_online_denoised_fft'],
mode='lines+markers',
name='TPK Online'))
fig.show()
# Display scatter plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['covid_bali_denoised_fft'],
mode='lines+markers',
name='Covid Cases'))
fig.show()
# Display scatter plot
# the variable 'penerbangan' is divided by 10000 to make it easier in understanding the graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['penerbangan']/10000,
mode='lines+markers',
name='Flight Passengers'))
fig.show()
# Display scatter plot
# the variable 'wisatawan' is divided by 10000 to make it easier in understanding the graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['wisatawan']/10000,
mode='lines+markers',
name='Domestic Tourists'))
fig.show()
# Display scatter plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK_arima'],
mode='lines+markers',
name='TPK Arima'))
fig.show()
The following is to test the correlations between variable Y and each variable X.
# Choose X variables and Y variable from data frame tpk_join
tpk_bps = tpk_join['TPK']
tpk_online = tpk_join['tpk_online_denoised_fft']
penerbangan = tpk_join['penerbangan']
wisatawan = tpk_join['wisatawan']
# Insignificant variables
covid_case_bali = tpk_join['covid_bali_denoised_fft']
tpk_bps_arima = tpk_join['TPK_arima']
hari = tpk_join['hari']
mobility = tpk_join['retail_and_recreation']
# Correlation check
np.corrcoef(tpk_online, tpk_bps)
array([[1. , 0.33625855],
[0.33625855, 1. ]])
# Correlation check
np.corrcoef(penerbangan, tpk_bps)
array([[1. , 0.98720155],
[0.98720155, 1. ]])
# Correlation check
np.corrcoef(wisatawan, tpk_bps)
array([[1. , 0.94955001],
[0.94955001, 1. ]])
# Correlation check
np.corrcoef(covid_case_bali, tpk_bps)
array([[ 1. , -0.4234527],
[-0.4234527, 1. ]])
# Correlation check
np.corrcoef(tpk_bps_arima, tpk_bps)
array([[1. , 0.9124071],
[0.9124071, 1. ]])
# Correlation check
np.corrcoef(hari, tpk_bps)
array([[ 1. , -0.1290198],
[-0.1290198, 1. ]])
# Correlation check
np.corrcoef(mobility, tpk_bps)
array([[1. , 0.75190638],
[0.75190638, 1. ]])
# Display scatter plots for variables with strong correlations (tpenerbangan, wisatawan, tpk_bps_arima)
plt.scatter(penerbangan, tpk_bps)
<matplotlib.collections.PathCollection at 0x186cb4a3640>
# Convert dataframe into numpy arrays
datatpk = tpk_online.to_numpy().reshape(-1,1)
datacovid = covid_case_bali.to_numpy().reshape(-1,1)
dataterbang = penerbangan.to_numpy().reshape(-1,1)
datawisata = wisatawan.to_numpy().reshape(-1,1)
datatpkarima = tpk_bps_arima.to_numpy().reshape(-1,1)
datahari = hari.to_numpy().reshape(-1,1)
datamobility = mobility.to_numpy().reshape(-1,1)
# Combine the chosen variables as DataX
dataX = np.hstack([datatpk,datawisata,dataterbang])
dataX
# tpk_bps
array([[3.74338774e+01, 8.79702000e+05, 1.09416900e+06],
[3.95708699e+01, 7.21105000e+05, 7.72595000e+05],
[2.55843384e+01, 5.67452000e+05, 5.27776000e+05],
[2.27531346e+01, 1.75120000e+05, 5.08740000e+04],
[3.10621692e+01, 1.01948000e+05, 4.04700000e+03],
[3.26358252e+01, 1.37395000e+05, 1.22730000e+04],
[3.03083601e+01, 2.29112000e+05, 4.34920000e+04],
[3.20977009e+01, 3.55732000e+05, 8.47210000e+04],
[3.72228774e+01, 2.83349000e+05, 8.13210000e+04],
[3.55185007e+01, 3.37304000e+05, 9.95620000e+04],
[3.87707490e+01, 4.25097000e+05, 1.69895000e+05],
[2.44761062e+01, 3.82841000e+05, 1.89485000e+05]])
# Read CSV file
tpk_harian_test = pd.read_csv('Datasets/test-online_booking_2021.csv')
# Remove nan value
tpk_harian_test.dropna()
# Change the datatype of variable 'tanggal' into datetime (YEAR-MONTH-DATE)
tpk_harian_test['tanggal'] = pd.to_datetime(tpk_harian_test['tanggal'], format="%m/%d/%Y")
# Calculate used_room based based on available room and total room
# And use it to calculate tpk_online (used room percentage)
tpk_harian_test['used_room'] = tpk_harian_test['room_total']-tpk_harian_test['all_available_room']
tpk_harian_test['tpk_online'] = tpk_harian_test['used_room'] / tpk_harian_test['room_total']*100
# Denoise the possible noisy data covid_harian
tpk_harian_test['tpk_online_denoised_fft'] = fft_denoiser(tpk_harian_test['tpk_online'], 10, to_real=True)
tpk_harian_test['tpk_online_denoised_fft']
0 40.222881
1 11.626967
2 39.564663
3 10.403010
4 40.100590
...
351 13.441035
352 40.089164
353 14.315314
354 37.725997
355 13.463850
Name: tpk_online_denoised_fft, Length: 356, dtype: float64
# Prepare variable 'tanggal' as index for grouping the data based on month (the following code)
tpk_harian_test = tpk_harian_test.set_index('tanggal')
# Make a new data frame, named tpk_harian_test_agg
# which groups the data into months using the median on the data, and 'tanggal' as the index
tpk_harian_test_agg = tpk_harian_test.groupby(by=tpk_harian_test.index.month).mean()
# Resets the index on new data frame and show the datatypes
tpk_harian_test_agg = tpk_harian_test_agg.reset_index()
tpk_harian_test_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | |
|---|---|---|---|---|---|---|
| 0 | 1 | 17361.016129 | 27941.919355 | 10580.903226 | 23.613467 | 23.584929 |
| 1 | 2 | 15538.517857 | 26191.535714 | 10653.017857 | 26.480454 | 26.423566 |
| 2 | 3 | 16422.633333 | 26036.833333 | 9614.200000 | 23.083330 | 23.261825 |
| 3 | 4 | 17021.983333 | 27078.533333 | 10056.550000 | 24.765484 | 24.696790 |
| 4 | 5 | 17185.810345 | 27388.155172 | 10202.344828 | 25.102572 | 25.034087 |
| 5 | 6 | 16729.866667 | 27310.683333 | 10580.816667 | 26.392109 | 26.431097 |
# Add a new column named 'Id' on the data frame tpk_harian_agg,
# value ranges from 1 to 6
# The display the data
tpk_harian_test_agg['Id'] = np.arange(start=1, stop=7, step=1)
tpk_harian_test_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | Id | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 17361.016129 | 27941.919355 | 10580.903226 | 23.613467 | 23.584929 | 1 |
| 1 | 2 | 15538.517857 | 26191.535714 | 10653.017857 | 26.480454 | 26.423566 | 2 |
| 2 | 3 | 16422.633333 | 26036.833333 | 9614.200000 | 23.083330 | 23.261825 | 3 |
| 3 | 4 | 17021.983333 | 27078.533333 | 10056.550000 | 24.765484 | 24.696790 | 4 |
| 4 | 5 | 17185.810345 | 27388.155172 | 10202.344828 | 25.102572 | 25.034087 | 5 |
| 5 | 6 | 16729.866667 | 27310.683333 | 10580.816667 | 26.392109 | 26.431097 | 6 |
The following code is used interchangeably for variable covid_harian_aktif (that variable is not significant so I choose to not waste so much space by rewriting the code).
# Read the csv file
covid_harian_test = pd.read_csv('Datasets/test-covid_cases_bali_harian_2021.csv')
# Remove nan value
covid_harian_test.dropna()
| tanggal | covid_bali_harian | |
|---|---|---|
| 0 | 1/1/2021 | 101 |
| 1 | 1/2/2021 | 165 |
| 2 | 1/3/2021 | 119 |
| 3 | 1/4/2021 | 118 |
| 4 | 1/5/2021 | 167 |
| ... | ... | ... |
| 176 | 6/26/2021 | 246 |
| 177 | 6/27/2021 | 174 |
| 178 | 6/28/2021 | 212 |
| 179 | 6/29/2021 | 238 |
| 180 | 6/30/2021 | 221 |
181 rows × 2 columns
# Denoise the possible noisy data
covid_harian_test['covid_bali_denoised_fft'] = fft_denoiser(covid_harian_test['covid_bali_harian'], 1000, to_real=True)
covid_harian_test['covid_bali_denoised_fft']
0 114.767774
1 134.210358
2 115.509309
3 126.602990
4 153.986990
...
176 244.978381
177 155.491723
178 217.884304
179 241.567884
180 226.365453
Name: covid_bali_denoised_fft, Length: 181, dtype: float64
# Change the datatype of variable 'tanggal' into datetime (YEAR-MONTH-DATE)
# Display information on datatypes
covid_harian_test['tanggal'] = pd.to_datetime(covid_harian_test['tanggal'], format="%m/%d/%Y")
covid_harian_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 181 entries, 0 to 180 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tanggal 181 non-null datetime64[ns] 1 covid_bali_harian 181 non-null int64 2 covid_bali_denoised_fft 181 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1) memory usage: 4.4 KB
# Prepare variable 'tanggal' as index for grouping the data based on month (the following code)
covid_harian_test = covid_harian_test.set_index('tanggal')
# Make a new data frame, named covid_harian_test_agg
# which groups the data into months using the median on the data, and 'tanggal' as the index
covid_harian_test_agg = covid_harian_test.groupby(pd.Grouper(freq='M')).mean()
# Reset the index on new data frame and show the datatypes
covid_harian_test_agg = covid_harian_test_agg.reset_index()
covid_harian_test_agg.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tanggal 6 non-null datetime64[ns] 1 covid_bali_harian 6 non-null float64 2 covid_bali_denoised_fft 6 non-null float64 dtypes: datetime64[ns](1), float64(2) memory usage: 272.0 bytes
# Display the new data frame
covid_harian_test_agg
| tanggal | covid_bali_harian | covid_bali_denoised_fft | |
|---|---|---|---|
| 0 | 2021-01-31 | 276.096774 | 276.014759 |
| 1 | 2021-02-28 | 287.964286 | 286.932277 |
| 2 | 2021-03-31 | 176.419355 | 177.099466 |
| 3 | 2021-04-30 | 166.233333 | 167.681506 |
| 4 | 2021-05-31 | 83.483871 | 81.325312 |
| 5 | 2021-06-30 | 98.600000 | 99.727514 |
# Add a new column named 'Id' on the data frame covid_harian_test_agg
# value ranges from 1 to 6
# The display the data
covid_harian_test_agg['Id'] = np.arange(start=1, stop=7, step=1)
covid_harian_test_agg
| tanggal | covid_bali_harian | covid_bali_denoised_fft | Id | |
|---|---|---|---|---|
| 0 | 2021-01-31 | 276.096774 | 276.014759 | 1 |
| 1 | 2021-02-28 | 287.964286 | 286.932277 | 2 |
| 2 | 2021-03-31 | 176.419355 | 177.099466 | 3 |
| 3 | 2021-04-30 | 166.233333 | 167.681506 | 4 |
| 4 | 2021-05-31 | 83.483871 | 81.325312 | 5 |
| 5 | 2021-06-30 | 98.600000 | 99.727514 | 6 |
# Read the csv file
penerbangan_test = pd.read_csv('Datasets/test-penerbangan_2021.csv')
# Add a new column named 'Id' on the data frame penerbangan
# value ranges from 1 to 6
# The display the data
penerbangan_test['Id'] = np.arange(start=1, stop=7, step=1)
penerbangan_test
| tanggal_ter | penerbangan | Id | |
|---|---|---|---|
| 0 | 1-Jan-21 | 119160 | 1 |
| 1 | 1-Feb-21 | 71122 | 2 |
| 2 | 1-Mar-21 | 117088 | 3 |
| 3 | 1-Apr-21 | 142329 | 4 |
| 4 | 1-May-21 | 121076 | 5 |
| 5 | 1-Jun-21 | 226287 | 6 |
# Read the csv file
wisatawan_test = pd.read_csv('Datasets/test-wisatawan_domestik_2021.csv')
# Add a new column named 'Id' on the data frame wisatawan
# value ranges from 1 to 6
# The display the data
wisatawan_test['Id'] = np.arange(start=1, stop=7, step=1)
wisatawan_test
| tanggal_wis | wisatawan | Id | |
|---|---|---|---|
| 0 | 1-Jan-21 | 282248 | 1 |
| 1 | 1-Feb-21 | 240608 | 2 |
| 2 | 1-Mar-21 | 305579 | 3 |
| 3 | 1-Apr-21 | 330593 | 4 |
| 4 | 1-May-21 | 363959 | 5 |
| 5 | 1-Jun-21 | 498852 | 6 |
# Read the csv file
tpk_bps_arima_test = pd.read_csv('Datasets/test-TPK_Hotel_berbintang_2021_plus_des_2020.csv')
# Add a new column named 'Id' on the data frame tpk_bps_arima
# value ranges from 1 to 6
# The display the data
tpk_bps_arima_test['Id'] = np.arange(start=1, stop=7, step=1)
tpk_bps_arima_test
| tanggal | TPK_arima | Id | |
|---|---|---|---|
| 0 | 1-Dec-20 | 19.00 | 1 |
| 1 | 1-Jan-21 | 14.55 | 2 |
| 2 | 1-Feb-21 | 7.99 | 3 |
| 3 | 1-Mar-21 | 8.56 | 4 |
| 4 | 1-Apr-21 | 9.01 | 5 |
| 5 | 1-May-21 | 8.65 | 6 |
# Read the csv file
hari_test = pd.read_csv('Datasets/test-hari_2021.csv')
# Add a new column named 'Id' on the data frame hari
# value ranges from 1 to 6
# The display the data
hari_test['Id'] = np.arange(start=1, stop=7, step=1)
hari_test
| tanggal | hari | Id | |
|---|---|---|---|
| 0 | 1-Jan-21 | 31 | 1 |
| 1 | 1-Feb-21 | 29 | 2 |
| 2 | 1-Mar-21 | 31 | 3 |
| 3 | 1-Apr-21 | 30 | 4 |
| 4 | 1-May-21 | 31 | 5 |
| 5 | 1-Jun-21 | 30 | 6 |
# Read the csv file
covid_total_test = pd.read_csv('Datasets/test-covid_cases_bali_total_2021.csv')
# Add a new column named 'Id' on the data frame covid_total
# value ranges from 1 to 6
# The display the data
covid_total_test['Id'] = np.arange(start=1, stop=7, step=1)
covid_total_test
| tanggal | covid_bali_total | Id | |
|---|---|---|---|
| 0 | 1/31/2021 | 26152 | 1 |
| 1 | 2/28/2021 | 34215 | 2 |
| 2 | 3/31/2021 | 39684 | 3 |
| 3 | 4/30/2021 | 44671 | 4 |
| 4 | 5/31/2021 | 47259 | 5 |
| 5 | 6/30/2021 | 50217 | 6 |
# Read the csv file
mobility_test = pd.read_csv('Datasets/test-google_mobility_2021.csv')
# Show information on data types
mobility_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 181 entries, 0 to 180 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 entity 181 non-null object 1 code 181 non-null object 2 day 181 non-null object 3 retail_and_recreation 181 non-null float64 4 grocery_and_pharmacy 181 non-null float64 5 residential 181 non-null float64 6 transit_stations 181 non-null float64 7 parks 181 non-null float64 8 workplaces 181 non-null float64 dtypes: float64(6), object(3) memory usage: 12.9+ KB
# Change the datatype of variable 'day' into datetime (YEAR-MONTH-DATE)
mobility_test['day'] = pd.to_datetime(mobility_test['day'], format="%m/%d/%Y")
mobility_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 181 entries, 0 to 180 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 entity 181 non-null object 1 code 181 non-null object 2 day 181 non-null datetime64[ns] 3 retail_and_recreation 181 non-null float64 4 grocery_and_pharmacy 181 non-null float64 5 residential 181 non-null float64 6 transit_stations 181 non-null float64 7 parks 181 non-null float64 8 workplaces 181 non-null float64 dtypes: datetime64[ns](1), float64(6), object(2) memory usage: 12.9+ KB
# Prepare variable 'day' as index for grouping the data based on month (the following code)
mobility_test = mobility_test.set_index('day')
# Make a new data frame, named mobility_test_agg
# which groups the data into months using the median on the data, and 'day' as the index
mobility_test_agg = mobility_test.groupby(pd.Grouper(freq='M')).median()
# Reset the index on new data frame and show the datatypes
mobility_test_agg = mobility_test_agg.reset_index()
mobility_test_agg.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 day 6 non-null datetime64[ns] 1 retail_and_recreation 6 non-null float64 2 grocery_and_pharmacy 6 non-null float64 3 residential 6 non-null float64 4 transit_stations 6 non-null float64 5 parks 6 non-null float64 6 workplaces 6 non-null float64 dtypes: datetime64[ns](1), float64(6) memory usage: 464.0 bytes
# Display the new data frame
mobility_test_agg
| day | retail_and_recreation | grocery_and_pharmacy | residential | transit_stations | parks | workplaces | |
|---|---|---|---|---|---|---|---|
| 0 | 2021-01-31 | -24.2860 | -9.4290 | 11.000 | -38.4290 | -22.4290 | -29.0000 |
| 1 | 2021-02-28 | -22.9285 | -6.2860 | 7.500 | -36.7140 | -25.5715 | -27.4285 |
| 2 | 2021-03-31 | -16.8570 | 1.8570 | 5.429 | -30.5710 | -17.0000 | -23.8570 |
| 3 | 2021-04-30 | -14.3575 | 7.7140 | 6.214 | -26.4285 | -16.2855 | -22.5715 |
| 4 | 2021-05-31 | -2.8570 | 21.2860 | 6.429 | -24.7140 | 6.8570 | -23.8570 |
| 5 | 2021-06-30 | -1.2855 | 20.8575 | 6.286 | -19.6430 | 1.9290 | -19.1430 |
# Add a new column named 'Id' on the data frame mobility_test_agg
# value ranges from 1 to 6
# The display the data
mobility_test_agg['Id'] = np.arange(start=1, stop=7, step=1)
mobility_test_agg
| day | retail_and_recreation | grocery_and_pharmacy | residential | transit_stations | parks | workplaces | Id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-31 | -24.2860 | -9.4290 | 11.000 | -38.4290 | -22.4290 | -29.0000 | 1 |
| 1 | 2021-02-28 | -22.9285 | -6.2860 | 7.500 | -36.7140 | -25.5715 | -27.4285 | 2 |
| 2 | 2021-03-31 | -16.8570 | 1.8570 | 5.429 | -30.5710 | -17.0000 | -23.8570 | 3 |
| 3 | 2021-04-30 | -14.3575 | 7.7140 | 6.214 | -26.4285 | -16.2855 | -22.5715 | 4 |
| 4 | 2021-05-31 | -2.8570 | 21.2860 | 6.429 | -24.7140 | 6.8570 | -23.8570 | 5 |
| 5 | 2021-06-30 | -1.2855 | 20.8575 | 6.286 | -19.6430 | 1.9290 | -19.1430 | 6 |
# Join all the data frames (used or not used in the final models)
# based on the column ID into a new data frame named tpk_join_test.
# And display the first five rows of the data.
tpk_join_test = pd.merge(tpk_harian_test_agg,covid_harian_test_agg,on='Id',how='left')
tpk_join_test = pd.merge(tpk_join_test,penerbangan_test,on='Id',how='left')
tpk_join_test = pd.merge(tpk_join_test,wisatawan_test,on='Id',how='left')
tpk_join_test = pd.merge(tpk_join_test,tpk_bps_arima_test,on='Id',how='left')
tpk_join_test = pd.merge(tpk_join_test,hari_test,on='Id',how='left')
tpk_join_test = pd.merge(tpk_join_test,covid_total_test,on='Id',how='left')
tpk_join_test = pd.merge(tpk_join_test,mobility_test_agg,on='Id',how='left')
tpk_join_test
| tanggal_x | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | Id | tanggal_y | covid_bali_harian | covid_bali_denoised_fft | ... | hari | tanggal | covid_bali_total | day | retail_and_recreation | grocery_and_pharmacy | residential | transit_stations | parks | workplaces | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 17361.016129 | 27941.919355 | 10580.903226 | 23.613467 | 23.584929 | 1 | 2021-01-31 | 276.096774 | 276.014759 | ... | 31 | 1/31/2021 | 26152 | 2021-01-31 | -24.2860 | -9.4290 | 11.000 | -38.4290 | -22.4290 | -29.0000 |
| 1 | 2 | 15538.517857 | 26191.535714 | 10653.017857 | 26.480454 | 26.423566 | 2 | 2021-02-28 | 287.964286 | 286.932277 | ... | 29 | 2/28/2021 | 34215 | 2021-02-28 | -22.9285 | -6.2860 | 7.500 | -36.7140 | -25.5715 | -27.4285 |
| 2 | 3 | 16422.633333 | 26036.833333 | 9614.200000 | 23.083330 | 23.261825 | 3 | 2021-03-31 | 176.419355 | 177.099466 | ... | 31 | 3/31/2021 | 39684 | 2021-03-31 | -16.8570 | 1.8570 | 5.429 | -30.5710 | -17.0000 | -23.8570 |
| 3 | 4 | 17021.983333 | 27078.533333 | 10056.550000 | 24.765484 | 24.696790 | 4 | 2021-04-30 | 166.233333 | 167.681506 | ... | 30 | 4/30/2021 | 44671 | 2021-04-30 | -14.3575 | 7.7140 | 6.214 | -26.4285 | -16.2855 | -22.5715 |
| 4 | 5 | 17185.810345 | 27388.155172 | 10202.344828 | 25.102572 | 25.034087 | 5 | 2021-05-31 | 83.483871 | 81.325312 | ... | 31 | 5/31/2021 | 47259 | 2021-05-31 | -2.8570 | 21.2860 | 6.429 | -24.7140 | 6.8570 | -23.8570 |
| 5 | 6 | 16729.866667 | 27310.683333 | 10580.816667 | 26.392109 | 26.431097 | 6 | 2021-06-30 | 98.600000 | 99.727514 | ... | 30 | 6/30/2021 | 50217 | 2021-06-30 | -1.2855 | 20.8575 | 6.286 | -19.6430 | 1.9290 | -19.1430 |
6 rows × 27 columns
# Convert dataframe into numpy arrays
datatpk_test = tpk_join_test['tpk_online'].to_numpy().reshape(-1,1)
datacovid_test = tpk_join_test['covid_bali_harian'].to_numpy().reshape(-1,1)
dataterbang_test = tpk_join_test['penerbangan'].to_numpy().reshape(-1,1)
datawisata_test = tpk_join_test['wisatawan'].to_numpy().reshape(-1,1)
datatpkarima_test = tpk_join_test['TPK_arima'].to_numpy().reshape(-1,1)
datahari_test = tpk_join_test['hari'].to_numpy().reshape(-1,1)
datamobility_test = tpk_join_test['retail_and_recreation'].to_numpy().reshape(-1,1)
# Combine the chosen variables as DataX_test
dataX_test = np.hstack([datatpk_test,datawisata_test,dataterbang_test])
dataX_test
array([[2.36134668e+01, 2.82248000e+05, 1.19160000e+05],
[2.64804538e+01, 2.40608000e+05, 7.11220000e+04],
[2.30833299e+01, 3.05579000e+05, 1.17088000e+05],
[2.47654844e+01, 3.30593000e+05, 1.42329000e+05],
[2.51025722e+01, 3.63959000e+05, 1.21076000e+05],
[2.63921091e+01, 4.98852000e+05, 2.26287000e+05]])
# Read csv file
data_true = pd.read_csv('Datasets/sample-submission_webbali.csv')
# Get the true Y
tpk_true = data_true['TPK']
The following are the models I tried running on the data:
- Linear Regression
- Ridge Regression
- Random Forest Regressor
- Support Vector Regression (SVR)
- K-Nearest Neighbor Regressor
- MLPRegressor (Neural Network Regression)
- Lasso Regression
- Decision Tree Regressor
# Import model library
from sklearn.linear_model import LinearRegression
# Create model instance
linear = LinearRegression()
# Set model hyper parameters
parameters={
'normalize':[True],
'fit_intercept':[True]
}
# Make model pipeline
model = make_pipeline(StandardScaler(), GridSearchCV(linear,parameters,scoring='neg_mean_gamma_deviance',cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
# Add legend
plt.legend()
# Display graph
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186cd7d45e0>]
# Calculate model accuracy
model_linear = make_pipeline(StandardScaler(), LinearRegression(fit_intercept=True, normalize=True))
score_linear = cross_val_score(model_linear, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_linear))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.0s finished
0.2546453866871174
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
array([ 8.11345865, 5.4562405 , 8.12035659, 9.48706019, 8.556209 ,
14.46835728])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_linear = math.sqrt(mse)
rmse_linear
2.4039382223936108
# Check Scikit Learn Version
# print('The scikit-learn version is {}.'.format(sklearn.__version__))
# Calculate Mean Absolute Percentage Error (MAPE)
mape_linear = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_linear
0.20634507924049714
# Calculate Mean Absolute Error (MAE)
mae_linear = mean_absolute_error(tpk_true,tpk_prediksi)
mae_linear
2.2163862991446948
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-linear.csv', index=None)
# Import model library
from sklearn.linear_model import Ridge
# Create model instance
ridge = Ridge()
# Set model hyper parameters
parameters={
# 'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100],
'alpha':[1e-15],
'normalize':[True],
'solver':['auto']
}
# Make model pipeline
model = make_pipeline(StandardScaler(), GridSearchCV(ridge,parameters,scoring='neg_mean_gamma_deviance',cv=12, verbose=1))
#Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186cd79e790>]
# Calculate model accuracy
model_ridge = make_pipeline(StandardScaler(), Ridge(alpha=[1e-15], normalize=True, solver='auto'))
score_ridge = cross_val_score(model_ridge, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_ridge))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.0s finished
0.2546453866871333
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
array([ 8.11345865, 5.4562405 , 8.12035659, 9.48706019, 8.556209 ,
14.46835728])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_ridge = math.sqrt(mse)
rmse_ridge
2.403938222393565
# Calculate Mean Absolute Percentage Error (MAPE)
mape_ridge = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_ridge
0.2063450792404925
# Calculate Mean Absolute Error (MAE)
mae_ridge = mean_absolute_error(tpk_true,tpk_prediksi)
mae_ridge
2.216386299144638
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-ridge.csv', index=None)
# Import model library
from sklearn.ensemble import RandomForestRegressor
# Create model instance
rf = RandomForestRegressor()
# Set model hyper parameters
parameters={
'n_estimators':[1000],
'criterion' : ['mae'],
'bootstrap': [False],
'max_features': ['sqrt'],
'min_samples_leaf': [3],
'min_samples_split': [4]
}
# Make model pipeline
model = make_pipeline(StandardScaler(), GridSearchCV(rf,parameters,scoring='neg_mean_gamma_deviance',cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186cd8577f0>]
# Calculate model accuracy
model_rf = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, bootstrap=False, max_features='log2', min_samples_leaf=1, min_samples_split=2))
score_rf = cross_val_score(model_rf, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_rf))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.6s finished
0.36368934747693693
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
array([ 8.00439, 4.99845, 8.00439, 9.51249, 10.97663, 23.96895])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_rf = math.sqrt(mse)
rmse_rf
3.756777448962474
# Calculate Mean Absolute Percentage Error (MAPE)
mape_rf = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_rf
0.24986748567255446
# Calculate Mean Absolute Error (MAE)
mae_rf = mean_absolute_error(tpk_true,tpk_prediksi)
mae_rf
2.977643333333317
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-rf.csv', index=None)
# Import model library
from sklearn.svm import SVR
# Create model instance
svr = SVR()
# Set model hyper parameters
parameters= {
'kernel' : ['linear'],
# 'degree' : [2], # only significant for poly and sigmoid
'gamma' : ['auto'],
'tol' : [1.5],
'C' : [1],
'epsilon' : [3],
'shrinking' : [False],
'cache_size' : [200],
'shrinking' : [False],
'max_iter' : [100]
},
# Make model pipeline
model = make_pipeline(StandardScaler(),GridSearchCV(svr,parameters,scoring='neg_mean_gamma_deviance', cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186cebaef10>]
# Calculate model accuracy
model_svr = make_pipeline(StandardScaler(), SVR(kernel='linear', C=1, gamma='auto', tol=1.5, epsilon=3, shrinking=False,
cache_size=200, max_iter=100))
score_svr = cross_val_score(model_svr, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_svr))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.0s finished
0.32353938262930093
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
# true: 11.15 8.99 10.24 10.09 10.35 16.68
array([10.00895387, 8.13057591, 10.49184745, 11.27913083, 11.57598922,
15.95167003])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_svr = math.sqrt(mse)
rmse_svr
0.961905177379139
# Calculate Mean Absolute Percentage Error (MAPE)
mape_svr = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_svr
0.08374976356222269
# Calculate Mean Absolute Error (MAE)
mae_svr = mean_absolute_error(tpk_true,tpk_prediksi)
mae_svr
0.8992946156370799
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-svr.csv', index=None)
# Import model library
from sklearn.neighbors import KNeighborsRegressor
# Create model instance
knr = KNeighborsRegressor()
# Set model hyper parameters
parameters= {
'n_neighbors' : [2],
'weights' : ['distance'],
'algorithm' : ['auto']},
# Make model pipeline
model = make_pipeline(StandardScaler(), GridSearchCV(knr,parameters,scoring='neg_mean_gamma_deviance', cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186cece9d30>]
# Calculate model accuracy
model_knr = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=2, weights='distance', algorithm='auto'))
score_knr = cross_val_score(model_knr, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_knr))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.0s finished
0.2851041652066483
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
array([11.27901183, 2.88688969, 12.08435434, 15.05259959, 15.75886452,
21.53990871])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_knr = math.sqrt(mse)
rmse_knr
4.437870457219751
# Calculate Mean Absolute Percentage Error (MAPE)
mape_knr = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_knr
0.36272524485107
# Calculate Mean Absolute Error (MAE)
mae_knr = mean_absolute_error(tpk_true,tpk_prediksi)
mae_knr
3.8846415492685757
# Calculate Mean Absolute Error (MAE)
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-knr.csv', index=None)
# Import model library
from sklearn.neural_network import MLPRegressor
# Create model instance
nnr = MLPRegressor()
# Create model instance
parameters= {
'random_state' : [None],
'hidden_layer_sizes' : [100]},
# Make model pipeline
model = make_pipeline(StandardScaler(), GridSearchCV(nnr,parameters,scoring='neg_mean_gamma_deviance', cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186cee64340>]
# Calculate model accuracy
model_nnr = make_pipeline(StandardScaler(), MLPRegressor(random_state=None, hidden_layer_sizes=100))
score_nnr = cross_val_score(model_nnr, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_nnr))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. c:\users\khans\appdata\local\programs\python\python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.5s finished
1.123895896360881
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
array([2.92602728, 2.31021197, 3.10940723, 2.98756554, 3.04524177,
4.82183295])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_nnr = math.sqrt(mse)
rmse_nnr
8.24134740626539
# Calculate Mean Absolute Percentage Error (MAPE)
mape_nnr = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_nnr
0.7162584216991825
# Calculate Mean Absolute Error (MAE)
mae_nnr = mean_absolute_error(tpk_true,tpk_prediksi)
mae_nnr
8.049952208802386
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-nnr.csv', index=None)
This model relatively works best when I use more variables, however the RMSE is still higher than of SVR with just 3 independent variables
# Import model library
from sklearn import linear_model
# Create model instance
lasso = linear_model.Lasso()
# Set model hyper parameters
parameters= {
'alpha' : [0.1],
'fit_intercept' : ['False'],
'normalize' : [True],
'precompute' : [True],
'copy_X' : [False],
'max_iter' : [2000],
'tol' : [0.0002],
'warm_start' : [True],
'positive' : [True],
'selection' : ['random'],
# 'random_state' : [3]
},
# Make model pipeline
model = make_pipeline(StandardScaler(), GridSearchCV(lasso,parameters,scoring='neg_mean_gamma_deviance', cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186cef33040>]
# Calculate model accuracy
model_lasso = make_pipeline(StandardScaler(), linear_model.Lasso(alpha=0.1, fit_intercept=True, normalize=True, precompute=True,
copy_X=False, max_iter=2000, tol=0.0002, warm_start=True,
positive=True, selection='random'))
score_lasso = cross_val_score(model_lasso, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance')
np.mean(abs(score_lasso))
0.14586718836431398
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
# true: 11.15 8.99 10.24 10.09 10.35 16.68
# versi 9997-lasso: array([11.49666859, 10.05485155, 10.38749577, 11.39712734, 9.21157405, 14.36640527])
array([ 8.14655524, 5.57529928, 8.16358282, 9.53057926, 8.65667923,
14.51061409])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_lasso = math.sqrt(mse)
rmse_lasso
2.340900530111385
# Calculate Mean Absolute Percentage Error (MAPE)
mape_lasso = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_lasso
0.2001806085425233
# Calculate Mean Absolute Error (MAE)
mae_lasso = mean_absolute_error(tpk_true,tpk_prediksi)
mae_lasso
2.152781679686726
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-lasso.csv', index=None)
# Import model library
from sklearn.tree import DecisionTreeRegressor
# Create model instance
decision = DecisionTreeRegressor()
# Set model hyper parameters
parameters= {
'random_state' : [None],
'max_features' : ['auto']},
# Make model pipeline
model = make_pipeline(StandardScaler(), GridSearchCV(decision,parameters,scoring='neg_mean_gamma_deviance', cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x186ceca8c70>]
# Calculate model accuracy
model_decision = make_pipeline(StandardScaler(), DecisionTreeRegressor(random_state=None, max_features='auto'))
score_decision = cross_val_score(model_decision, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_decision))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.0s finished
0.34598871863826136
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
array([ 9.53, 3.68, 9.53, 9.32, 9.53, 25.41])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_decision = math.sqrt(mse)
rmse_decision
4.2583799736519525
# Calculate Mean Absolute Percentage Error (MAPE)
mape_decision = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_decision
0.24736753859221494
# Calculate Mean Absolute Error (MAE)
mae_decision = mean_absolute_error(tpk_true,tpk_prediksi)
mae_decision
2.9933333333333336
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/hasilv999991-decision.csv', index=None)
Berdasarkan nilai score mean absolute error-nya
# Choose the best accuracy with cross validation score
list_model = ['linear', 'ridge', 'rf', 'svr', 'knr', 'nnr', 'lasso', 'decision']
score = [
np.mean(abs(score_linear)),
np.mean(abs(score_ridge)),
np.mean(abs(score_rf)),
np.mean(abs(score_svr)),
np.mean(abs(score_knr)),
np.mean(abs(score_nnr)),
np.mean(abs(score_lasso)),
np.mean(abs(score_decision))
]
list_model[np.argmin(score)]
'lasso'
# Choose the best RMSE
list_model_rmse = ['linear', 'ridge', 'rf', 'svr', 'knr', 'nnr', 'lasso', 'decision']
rmse = [
np.mean(abs(rmse_linear)),
np.mean(abs(rmse_ridge)),
np.mean(abs(rmse_rf)),
np.mean(abs(rmse_svr)),
np.mean(abs(rmse_knr)),
np.mean(abs(rmse_nnr)),
np.mean(abs(rmse_lasso)),
np.mean(abs(rmse_decision))
]
list_model_rmse[np.argmin(rmse)]
'svr'
## Choose the best MAPE
list_model_mape = ['linear', 'ridge', 'rf', 'svr', 'knr', 'nnr', 'lasso', 'decision']
mape = [
np.mean(abs(mape_linear)),
np.mean(abs(mape_ridge)),
np.mean(abs(mape_rf)),
np.mean(abs(mape_svr)),
np.mean(abs(mape_knr)),
np.mean(abs(mape_nnr)),
np.mean(abs(mape_lasso)),
np.mean(abs(mape_decision))
]
list_model_mape[np.argmin(mape)]
'svr'
## Choose the best MAE
list_model_mae = ['linear', 'ridge', 'rf', 'svr', 'knr', 'nnr', 'lasso', 'decision']
mae = [
np.mean(abs(mae_linear)),
np.mean(abs(mae_ridge)),
np.mean(abs(mae_rf)),
np.mean(abs(mae_svr)),
np.mean(abs(mae_knr)),
np.mean(abs(mae_nnr)),
np.mean(abs(mae_lasso)),
np.mean(abs(mae_decision))
]
list_model_mae[np.argmin(mae)]
'svr'
# Comparing RMSE with the original obtained by instructor during Workshop
math.sqrt(68.88720689917587)
8.299831739208685